Implement LMDB-based multi-modal cache#30373
Implement LMDB-based multi-modal cache#30373petersalas wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new LMDB-based multi-modal cache, a significant feature enabling caching across multiple API server workers or engine processes on the same machine. The implementation is well-designed, featuring a dedicated evictor process with an adaptive strategy, object chunking to prevent fragmentation, and transaction management to ensure data consistency while minimizing lock contention. The changes are well-integrated into the existing caching framework, and the new functionality is accompanied by a solid set of tests. I've identified one minor issue regarding unreachable code, which is detailed in a specific comment. Overall, this is a high-quality and well-executed contribution.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Hi @petersalas, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
DarkLight1337
left a comment
There was a problem hiding this comment.
Some very initial comments
|
This pull request has merge conflicts that must be resolved before it can be |
ed905a8 to
187de7b
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
75fb09d to
b28d0be
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
Heads-up that I have implemented the refactor to the cache factories in #32382, and have added you as co-author. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
|
This pull request has merge conflicts that must be resolved before it can be |
6661fa8 to
3a190ad
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
3a190ad to
5943789
Compare
Signed-off-by: Peter Salas <peter@fixie.ai>
5943789 to
b4d18a7
Compare
|
As one might reasonably expect, is marginally slower than the shm cache in a microbenchmark, but in exchange supports api-server scale-out which can significantly improve tail latency in multi-modal heavy inference scenarios. |
There was a problem hiding this comment.
Asked @ywang96 to take a look since I did a pass before already
Purpose
This implements an LMDB-based multi-modal item cache which supports LRU-based eviction, multiple API server workers, and/or multiple Engine processes.
BaseMultiModalProcessorCache/BaseMultiModalReceiverCachenow include abegin()method which (for the LMDB implementation) uses a transaction. In the sender-side cache, writes are queued outside of the transaction scope so that processing/serialization all occur outside of the scope of the write transaction (LMDB has single writer semantics).One caveat with this implementation is that any item that hasn't been used within a fixed time window (
--mm-lmdb-cache-min-eviction-age, defaults to 600 seconds) may be evicted, even if there are queued requests still depending on those items (the worker'sexecute_modelwill raise in that case). A future improvement could be to instead track the oldest active request in each of the frontends (OutputProcessorseemingly already has this) and use that instead.Test Plan
Tested across permutations of API/Tensor/Data parallelism.
(Happy to run any suggested benchmarks as well!)
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.